fix: issue#915, Error for large integers in Series #1233

Sohaib90 · 2023-01-05T16:20:03Z

This patch fixes issue #915: Error for series with large integers.

The issue is caused when using the numpy.histogram function with large integers which causes unevenly spaced bin edges.
here:

https://github.com/ydataai/pandas-profiling/blob/5b1abac48ed9ed5a9e7e662be30c913acc3e7a5b/src/pandas_profiling/model/summary_algorithms.py#L39

and here:

https://github.com/ydataai/pandas-profiling/blob/5b1abac48ed9ed5a9e7e662be30c913acc3e7a5b/src/pandas_profiling/model/summary_algorithms.py#L52

This can cause the resulting histogram to be distorted or misleading, as the bin sizes may not be uniform.

To resolve this issue, I used the numpy.histogram_bin_edges function to compute the bin edges for the data before passing them to the numpy.histogram function. This function allows to specify the number of bins and the range of the data, and will compute the bin edges in a way that ensures they are evenly spaced. This fix does not raise an error as reported in the bug report and successfully generates a report. I have also included a test_issue915.py for testing the generation of the report.

Sohaib90 · 2023-01-23T19:41:12Z

@alexbarros you might want to look into this as well if you get the chance :)

alexbarros · 2023-01-24T13:49:41Z

src/pandas_profiling/model/summary_algorithms.py

+    bins = np.histogram_bin_edges(finite_values, bins=bins_arg)
+    stats[name] = np.histogram(finite_values, bins=bins, weights=weights)


interesting solution, but still seems to behave a bit weirdly with big numbers

> import numpy as np > arr = np.array([716277643516076032 + i for i in range(100)]) > bins = np.histogram_bin_edges(arr, bins=5) > np.histogram(arr, bins=bins) (array([ 0, 0, 65, 0, 35]), array([7.16277644e+17, 7.16277644e+17, 7.16277644e+17, 7.16277644e+17, 7.16277644e+17, 7.16277644e+17]))

True, it is still not evenly distributed like it should be for smaller numbers. What do you propose here? Leaving np.histogram_bin_edges raises an error for larger numbers. Is it better to raise error than have weird behavior?

Thinking from a user's perspective, for me is better to have an error being raised than an incorrect plot. If I know that there was a problem with the large integers I can preprocess that column and run again, but an incorrect result may lead me to an incorrect interpretation of my data distribution.

Yeah, that is what I was thinking as well. I think I should make the changes so that it leads to raising an error rather than making an incorrect plot, right?

Also there is a check Codacy Static Code Analysis that is failing. I think that is a new one

src/pandas_profiling/model/summary_algorithms.py

codecov-commenter · 2023-01-24T14:14:34Z

Codecov Report

Base: 90.47% // Head: 90.49% // Increases project coverage by +0.01% 🎉

Coverage data is based on head (01c2677) compared to base (d685678).
Patch coverage: 91.30% of modified lines in pull request are covered.

📣 This organization is not using Codecov’s GitHub App Integration. We recommend you install it so Codecov can continue to function properly for your repositories. Learn more

Additional details and impacted files

@@             Coverage Diff             @@
##           develop    #1233      +/-   ##
===========================================
+ Coverage    90.47%   90.49%   +0.01%     
===========================================
  Files          185      186       +1     
  Lines         5661     5681      +20     
===========================================
+ Hits          5122     5141      +19     
- Misses         539      540       +1

Flag	Coverage Δ
py3.8-ubuntu-latest-pandas	`90.49% <91.30%> (+0.01%)`	⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files	Coverage Δ
src/pandas_profiling/model/summary_algorithms.py	`74.41% <66.66%> (-0.29%)`	⬇️
tests/issues/test_issue915.py	`100.00% <100.00%> (ø)`

Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here.

☔ View full report at Codecov.
📢 Do you have feedback about the report comment? Let us know in this issue.

Sohaib90 changed the title ~~Issue#915: Fixed the Error for large integers in Series~~ fix: Issue#915, Error for large integers in Series Jan 5, 2023

Sohaib90 changed the base branch from master to develop January 20, 2023 13:54

Sohaib90 force-pushed the issue915 branch from ac171fd to 63a4ef1 Compare January 20, 2023 14:20

Sohaib90 changed the title ~~fix: Issue#915, Error for large integers in Series~~ fix: issue#915, Error for large integers in Series Jan 24, 2023

Sohaib90 added 3 commits January 24, 2023 10:57

fix: issue#915 error for large integers

0fabbf7

fix: issue#915 added histogram_bin_egdes with max_bins

56e102a

fix: issue#915 lint issues

21d8042

Sohaib90 force-pushed the issue915 branch from 770e5f3 to 21d8042 Compare January 24, 2023 10:02

Sohaib90 and others added 4 commits January 24, 2023 11:04

Merge branch 'develop' into issue915

3da095e

Merge branch 'develop' into issue915

628493e

fix: issue#915 flake8 errors

78cfb30

fix: issue#915 bin args

e689665

aquemy requested review from aquemy and alexbarros January 24, 2023 13:38

alexbarros reviewed Jan 24, 2023

View reviewed changes

src/pandas_profiling/model/summary_algorithms.py Outdated Show resolved Hide resolved

fix: issue#915 rename var bin_args

6a5409e

Sohaib90 force-pushed the issue915 branch from 5a7c611 to 6a5409e Compare January 24, 2023 14:01

Merge branch 'develop' into issue915

0f66735

fabclmnt requested a review from alexbarros January 25, 2023 03:20

Merge branch 'develop' into issue915

01c2677

vascoalramos force-pushed the develop branch 2 times, most recently from 6ba2217 to ef023c3 Compare January 30, 2023 17:43

aquemy force-pushed the develop branch from 79856bc to b722b70 Compare March 8, 2023 13:25

aquemy force-pushed the develop branch from 9777b85 to 40fb0c2 Compare May 24, 2023 07:57

aquemy force-pushed the develop branch 2 times, most recently from 4500563 to cfb020d Compare June 21, 2023 12:39

aquemy force-pushed the develop branch from 6a3342a to 8f4f622 Compare October 10, 2023 10:25

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: issue#915, Error for large integers in Series #1233

fix: issue#915, Error for large integers in Series #1233

Sohaib90 commented Jan 5, 2023 •

edited by fabclmnt

Loading

Sohaib90 commented Jan 23, 2023

alexbarros Jan 24, 2023

Sohaib90 Jan 24, 2023

alexbarros Jan 26, 2023

Sohaib90 Jan 26, 2023 •

edited

Loading

codecov-commenter commented Jan 24, 2023 •

edited

Loading

		bins = np.histogram_bin_edges(finite_values, bins=bins_arg)
		stats[name] = np.histogram(finite_values, bins=bins, weights=weights)

fix: issue#915, Error for large integers in Series #1233

Are you sure you want to change the base?

fix: issue#915, Error for large integers in Series #1233

Conversation

Sohaib90 commented Jan 5, 2023 • edited by fabclmnt Loading

Sohaib90 commented Jan 23, 2023

alexbarros Jan 24, 2023

Choose a reason for hiding this comment

Sohaib90 Jan 24, 2023

Choose a reason for hiding this comment

alexbarros Jan 26, 2023

Choose a reason for hiding this comment

Sohaib90 Jan 26, 2023 • edited Loading

Choose a reason for hiding this comment

codecov-commenter commented Jan 24, 2023 • edited Loading

Codecov Report

Sohaib90 commented Jan 5, 2023 •

edited by fabclmnt

Loading

Sohaib90 Jan 26, 2023 •

edited

Loading

codecov-commenter commented Jan 24, 2023 •

edited

Loading